{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcdd Exercise M1.03\n",
    "\n",
    "The goal of this exercise is to compare the performance of our classifier in\n",
    "the previous notebook (roughly 81% accuracy with `LogisticRegression`) to some\n",
    "simple baseline classifiers. The simplest baseline classifier is one that\n",
    "always predicts the same class, irrespective of the input data.\n",
    "\n",
    "- What would be the score of a model that always predicts `' >50K'`?\n",
    "- What would be the score of a model that always predicts `' <=50K'`?\n",
    "- Is 81% or 82% accuracy a good score for this problem?\n",
    "\n",
    "Use a `DummyClassifier` and do a train-test split to evaluate its accuracy on\n",
    "the test set. This\n",
    "[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)\n",
    "shows a few examples of how to evaluate the generalization performance of\n",
    "these baseline models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first split our dataset to have the target separated from the data used to\n",
    "train our predictive model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=target_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by selecting only the numerical columns as seen in the previous\n",
    "notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_columns = [\"age\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\n",
    "\n",
    "data_numeric = data[numerical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the data and target into a train and test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a `DummyClassifier` such that the resulting classifier always predict the\n",
    "class `' >50K'`. What is the accuracy score on the test set? Repeat the\n",
    "experiment by always predicting the class `' <=50K'`.\n",
    "\n",
    "Hint: you can set the `strategy` parameter of the `DummyClassifier` to achieve\n",
    "the desired behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}